2. Analysis with GeoPandas¶

ENV 859 - Fall 2022
© John Fay, Duke University

In this notebook, we examine how GeoPandas is used in peforming a spatial analysis. We take an example of looking at the demographic characteristics of where electric vehicle (EV) charging stations have been located in Durham, Wake, and Orange counties with respect to 2010 Census and Social Vulnerability Index values. (It's not a very sensible analysis as done here, but we are concentrating on the mechanics more than the utility of the analysis...)

Learning Objectives:¶

  • Executing a "*data science workflow*" with a GeoPandas
  • Read data into a geodataframe (CSV and GeoJSON)
  • Explore the data: columns/column types, summaries, plots
  • Analyze the data...
  • Visualize results
  • Subseting features in a geodataframe by attribute
  • Merging geodataframes
  • Dissolving geodataframe features based on an attribute value
  • Joining attributes to a geodataframe
  • Spatially joining data from one geodataframe to another
  • Generating various plots from single and multiple geodataframes
  • Saving a geodataframe to a feature class

Workflow:¶

  • Part 1: Fetching and Exploring Data
  • 1.1 Import packages
  • 1.2 Read CSV data into geodataframe
  • 1.3 Explore the data in the geodataframe
  • 1.4 Import census data to geodataframe via web service
  • Part 2: Analysis (and Visualization)
  • 2.1 Subset EV features by attribute
  • 2.2 Merge the three county geodataframes into a single geodataframe
  • 2.3 Dissolve block features to the tract level
  • 2.4 Import and join the social vulnerability data to tract features
  • 2.5 Compute population density for each tract
  • 2.6 Subset EV stations spatially
  • 2.7 Spatially join tract attributes to EV features
  • Part 3: Share Results
  • 3.1 Share your notebook as html or on GitHub
  • 3.2 Explot your geodataframe(s) as feature classes or CSV files

Part 1: Fetching and Exploring the Data¶

Here we'll gather and explore the data we'll be using in our analysis. This includes two datasets. First is the list of EV Charging locations, stored as a CSV file in our data folder. This dataset has coordinate columns that we'll use to construct points and convert into a geodataframe as we learned in our previous lessons.

The second dataset is comprised of 2010 Census BlockGroup data for all of North Carolina. We'll fetch these data from an on line resource using a web service. We'll revisit how web services later; for now, we'll use this process to fetch data for three counties: Durham, Wake, and Orange.

For each dataset, we'll get the data into geodataframe format and then explore the data in various ways. Then we'll move to Part 2 where we analyse the data.

1.1: Import packages needed in the analysis¶

In [1]:
#Import packages
import pandas as pd
import geopandas as gpd

import matplotlib.pyplot as plt
import contextily as ctx
In [2]:
#Command to run if autocomplete is slooooowwww
%config Completer.use_jedi = False

♦ Knowledge Check ♦

→ Can you explain what role each package imported might do in our analysis?

1.2: Create a geodataframe from a CSV file¶

As done in a previous notebook, we want to:

  • Import a CSV file containing coordinate columns into a Pandas dataframe,
  • Create a collection of Shapely points from the coordinate fields, and
  • Create a geodataframe from the components.
In [3]:
#Read in charging stations CSV, convert to geodataframe
df = pd.read_csv('../data/NC_Charging_Stations.csv')

#Create the geoseries from the X and Y columns
geom = gpd.points_from_xy(x=df['Longitude'],y=df['Latitude'])

#Convert to a geodataframe, setting the CRS to WGS 84 (wkid=4326)
gdf_stations_all = gpd.GeoDataFrame(df,geometry=geom,crs=4326)

1.3: Explore the data in geodataframe¶

Have a quick look at the contents imported. Things to check include:

  • How many rows and columns were imported
  • The names, data types, and number of non-null values in each column
  • Examine a single record from the geodataframe
  • What geometry types are included in the geodataframe?
  • Summary statistics of numeric columns, if applicable
  • Correlations among column values, if applicable
  • Spatial plot of the data
In [4]:
#Show the number rows and columns
gdf_stations_all.shape
Out[4]:
(1227, 11)
In [6]:
#Examine the structure of the dataframe
gdf_stations_all.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 1227 entries, 0 to 1226
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   ID              1227 non-null   int64   
 1   Fuel Type Code  1227 non-null   object  
 2   Station Name    1227 non-null   object  
 3   City            1227 non-null   object  
 4   State           1227 non-null   object  
 5   ZIP             1227 non-null   object  
 6   Status Code     1227 non-null   object  
 7   Latitude        1227 non-null   float64 
 8   Longitude       1227 non-null   float64 
 9   Facility Type   450 non-null    object  
 10  geometry        1227 non-null   geometry
dtypes: float64(2), geometry(1), int64(1), object(7)
memory usage: 105.6+ KB
In [7]:
#Examine a single record of data
gdf_stations_all.iloc[0]
Out[7]:
ID                                               39016
Fuel Type Code                                    ELEC
Station Name      City of Raleigh - Municipal Building
City                                           Raleigh
State                                               NC
ZIP                                              27601
Status Code                                          E
Latitude                                     35.778416
Longitude                                    -78.64347
Facility Type                           STREET_PARKING
geometry                   POINT (-78.64347 35.778416)
Name: 0, dtype: object
In [9]:
#Reveal the geometry type(s) contained inthe geodataframae
gdf_stations_all['geometry'].type
#gdf_stations_all['geometry'].type.unique()
Out[9]:
array(['Point'], dtype=object)
In [11]:
#Plot the data
gdf_stations_all.plot().grid()

♦ Knowledge Check ♦

→ Could you performm the same steps of importing and exploring the USGS gage locations in NC stored in the CSV file ../data/gages.csv?

  • Convert data to a geodataframe: gdf_gages
  • Explore the data
  • Plot the gage sites

1.4. Import NC Census Block Group features via NC OneMap's web service¶

We will explore web services a bit later, but we'll use the code below to acquire polygon data of census block groups for Durham, Wake, and Orange counties from an NC OneMap Web Service. Once imported, we'll merge these geodataframes together and use them in our subsequet analyses.

  • First, to simplify matters, I've created a Python function to fetch data for a specific county given its FIPS code.

(We'll examine how this works a bit later...)

In [12]:
#Create a function to read NCOneMap feature services into a geodataframe
def getBlockGroupData(FIPS):
    #Construct the url from the function arguments
    url=f'https://services.nconemap.gov/secure/rest/services/NC1Map_Census/FeatureServer/8/query?' + \
        f"where=GEOID10+LIKE+'{FIPS}%'&outFields=GEOID10,TOTAL_POP&f=geojson"
    
    #Create a geodataframe from the URL
    gdf = gpd.read_file(url)
    
    #Return the geodataframe
    return gdf
  • Now, we apply that function for the three counties we want to examine
In [14]:
#Fetch census block groups for Durham, Orange, and Wake counties using the above function
gdf_DurmBlkGroups = getBlockGroupData(37063)
gdf_WakeBlkGroups = getBlockGroupData(37183)
gdf_OrangeBlkGroups = getBlockGroupData(37135)
  • Challenge: See if you can fetch Chatham county block groups (FIPS = 37037)
In [13]:
#Challenge: See if you can fetch Chatham county block groups (FIPS = 37037)
gdf_Chatam = getBlockGroupData(37037)
  • Explore the data...
  • What is its coordinate reference system?
  • What columns are included?
  • What does the first record look like?
In [16]:
#Show the Durham block group geodataframe's coordinate reference system
gdf_DurmBlkGroups.crs
Out[16]:
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
In [17]:
#Explore the Durham block group geodataframe's columns...
gdf_DurmBlkGroups.columns
Out[17]:
Index(['geoid10', 'total_pop', 'geometry'], dtype='object')
In [18]:
#Examine a sample record from the geodataframe
gdf_DurmBlkGroups.iloc[10]
Out[18]:
geoid10                                           370630007002
total_pop                                                 1030
geometry     POLYGON ((-78.9175097042526 35.97670311102937,...
Name: 10, dtype: object
In [19]:
gdf_DurmBlkGroups.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   geoid10    153 non-null    object  
 1   total_pop  153 non-null    int64   
 2   geometry   153 non-null    geometry
dtypes: geometry(1), int64(1), object(1)
memory usage: 3.7+ KB
  • Visualize the data...
In [23]:
#Plot Durham's population, setting the colormap to "viridis"
gdf_DurmBlkGroups.plot(column='total_pop',cmap='magma')
Out[23]:
<AxesSubplot: >
In [29]:
#Plot the block groups for all three counties
thePlot = gdf_DurmBlkGroups.plot(color='Blue')

#First, plot one geodataframe, saving the output as "thePlot"
gdf_WakeBlkGroups.plot(ax=thePlot,color='Red')

#Plot another geodataframe, telling it to use the axes in "thePlot" created above
gdf_OrangeBlkGroups.plot(ax=thePlot,color="Lightblue")

#Repeat...
Out[29]:
<AxesSubplot: >

Part 2: The Analysis (and Visualization)¶

Now that we've obtained a few datasets and got them into geodataframes, we'll perform some analysis. These include:

  • Subsetting the EV charging stations for those in specific cities.
  • Identifying the census blocks surrounding each EV station, within a distance of 1km
  • To do this, we'll merge the Durham, Wake, and Orange Co block data selected above
  • Then we'll buffer our selected EV station points a distance of 1km
  • And finally, we'll select blocks that intersect the EV station buffers

2.1: Subset the EV Station points based on attribute values¶

Doc: https://geopandas.org/indexing.html

Subsetting features in a geodataframe uses the same methods as subsetting recordsin a Pandas dataframe. Here we'll run through an example by subsetting EV stations found oly within Durham, Raleigh, and Chapel Hill.

  • 2.1.1 Examine unique values in the City column
In [30]:
#Reveal the unique values in the City column
gdf_stations_all['City'].unique()
Out[30]:
array(['Raleigh', 'Concord', 'Fayetteville', 'High Point', 'Lumberton',
       'Durham', 'Charlotte', 'Sanford', 'Burlington', 'Asheville',
       'Forest City', 'Gastonia', 'Goldsboro', 'Greensboro', 'Greenville',
       'Hendersonville', 'Hickory', 'Jacksonville', 'Montreat',
       'New Bern', 'Reidsville', 'Roanoke Rapids', 'Rockingham',
       'Mt. Airy', 'Salisbury', 'Southern Pines', 'Statesville',
       'Wake Forest', 'Wilkesboro', 'Wilmington', 'Wilson',
       'Winston-Salem', 'Clyde', 'Cornelius', 'Elizabeth City',
       'Asheboro', 'Black Mountain', 'Boone', 'Cary', 'Chapel Hill',
       'Hillsborough', 'Cherokee', 'Flat Rock', 'Clinton', 'Knightdale',
       'Mt. Holly', 'Mooresville', 'Weaverville', 'Fletcher', 'Pittsboro',
       'Belmont', 'Lowell', 'Dallas', 'Lincolnton', 'Arden',
       'Kernsersville', 'Mebane', 'Point Harbor', 'Nags Head',
       'Huntersville', 'Ocean Isle Beach', 'Holden Beach', 'Lenoir',
       'Waynesville', 'Maggie Valley', 'Granite Falls', 'Hot Springs',
       'West Jefferson', 'Cullowhee', 'Graham', 'Indian Trail',
       'Kings Mountain', 'Old Fort', 'Henderson', 'Garner', 'Roxboro',
       'Castle Hayne', 'Star', 'Carrboro', 'Brevard', 'Archdale', 'Sylva',
       'Dillsboro', 'Kinston', 'Oriental', 'Monroe', 'Pembroke',
       'Lake Lure', 'Calabash', 'Shallotte', 'Research Triangle Park',
       'Cherokee,', 'Gibsonville', 'Elkin', 'Dobson', 'Davidson',
       'Franklin', 'Rocky Mount', 'Wallace', 'Hatteras', 'Mills River',
       'Banner Elk', 'Beaufort', 'Benson', 'Blowing Rock', 'Buxton',
       'Cape Carteret', 'Cashiers', 'Duck', 'Elon', 'Highlands',
       'Kannapolis', 'Kill Devil Hills', 'Manteo', 'Murphy', 'Newton',
       'Pinehurst', 'Robbinsville', 'Saluda', 'Selma', 'Southport',
       'Tryon', 'Whitsett', 'Lexington', 'Shelby', 'Burgaw', 'Matthews',
       'Plymouth', 'Madison', 'Avon', 'Fuquay Varina', 'Moncure',
       'Mt Airy', 'King', 'Smithfield', 'Dunn', 'Lilesville', 'Raeford',
       'Snow Hill', 'Evergreen', 'Edenton', 'Dudley', 'Sneads Ferry',
       'Aulander', 'Mcleansville', 'Lillington', 'Sunset Beach',
       'Bessemer City', 'Leland', 'Conover', 'Emerald Isle', 'Swansboro',
       'Aberdeen', 'Marion', 'Enfield', 'Nashville', 'Warrenton',
       'Powells Point', 'Hamptonville', 'Winterville', 'Linville', 'Apex',
       'Cherry Point', 'Morrisville', 'Butner', 'SOUTHPORT', 'Haw River',
       'Colfax', 'Halifax', 'Mocksville', 'Albemarle', 'Sparta',
       'Troutman', 'Hampstead', 'Siler City', 'Tarboro', 'Sugar Grove',
       'Cramerton', 'Hope Mills', 'Kitty Hawk', 'Robbins', 'Harrisburg',
       'Spindale', 'Yadkinville', 'Pilot Mountain', 'Ellerbe',
       'Burnsville', 'Scotland Neck', 'Rose Hill', 'Elizabethtown',
       'Warsaw', 'Waxhaw', 'Pisgah Forest', 'Sugar Mountain', 'Littleton',
       'Mars Hill', 'Marshville', 'Holly Springs', 'Louisburg',
       'Oak Island', 'Washington', 'Bakersville', 'Windsor', 'Candler',
       'Rich Square', 'Thomasville', 'Jackson', 'Troy', 'Morehead City',
       'Kernersville', 'Oxford', 'Corolla', 'Rolesville', 'Whiteville',
       'Jamestown', 'Mint Hill'], dtype=object)
  • 2.2.2 Subset records for those where the City is "Durham", "Raleigh", or "Chapel Hill"
In [31]:
#Subset records where the City is "Durham", "Raleigh", or "Chapel Hill"
gdf_stations = gdf_stations_all.query('City in ("Raleigh", "Durham", "Chapel Hill")')
gdf_stations.shape
Out[31]:
(205, 11)
In [32]:
#Recall, an alternative is to use masks...
gdf_stations = gdf_stations_all.loc[
    (gdf_stations_all['City'] == 'Raleigh')|
    (gdf_stations_all['City'] == 'Durham')|
    (gdf_stations_all['City'] == 'Chapel Hill')] 
gdf_stations.shape
Out[32]:
(205, 11)
In [35]:
#Another, more efficient mask using `isin`
gdf_stations = gdf_stations_all.loc[gdf_stations_all['City'].isin(("Raleigh", "Durham", "Chapel Hill"))]
gdf_stations.shape
Out[35]:
(205, 11)
  • 2.1.3 Explore the results...
In [37]:
#Plot the results
myPlot = gdf_DurmBlkGroups.plot()
gdf_stations.plot("City",ax=myPlot);
In [38]:
#Plot them with a base map (using Contextily - more later...)
city_plot = gdf_stations.to_crs(3857).plot(column="City",figsize=(10,10),alpha=0.6)
ctx.add_basemap(city_plot)

2.2: Merge the 3 county block group geodataframes into one¶

Doc: https://geopandas.org/mergingdata.html

  1. Check that all data have the same crs
  2. Optionally, add a field to identify the source geodataframe
  3. Apply the append() function
  4. Check/explore the result

We'll start by appending the Wake Co. dataset to the Durham Co. one. Then you will append the Orange Co. dataframe to that product.

  • 2.2.1 Check that the two files share the same coordinate reference system

→If the were different, we could use to_crs() to transform one dataset to use the crs of the other

In [39]:
#Check the crs of the two geodataframes
gdf_DurmBlkGroups.crs == gdf_WakeBlkGroups.crs
Out[39]:
True
  • 2.2.2 Add an identifying column to the source geodataframes
In [55]:
#Add a field to each input, setting values to identify the source dataset
gdf_DurmBlkGroups['County'] = "Durham"
gdf_WakeBlkGroups['County'] = "Wake"
  • 2.2.3 Append one dataframe to the other
In [56]:
#Append the Wake Co features to the Durham Co features,
gdf_BlkGrp_step1 = gdf_DurmBlkGroups.append(gdf_WakeBlkGroups)
C:\Users\jo158\AppData\Local\Temp\ipykernel_9068\4223542252.py:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  gdf_BlkGrp_step1 = gdf_DurmBlkGroups.append(gdf_WakeBlkGroups)
  • 2.2.4 Explore the result
In [57]:
#Check to see that the total rows in the merged gdf match the sum of the two component gdfs
gdf_DurmBlkGroups.shape[0] + gdf_WakeBlkGroups.shape[0] == gdf_BlkGrp_step1.shape[0]
Out[57]:
True
In [58]:
#Plot the result
gdf_BlkGrp_step1.plot()
Out[58]:
<AxesSubplot: >

♦ TASK ♦

Append the Orange Co blockgroup features to the gdf_BlkGrp_step geodataframe we just created.

Remember to:

  • check that the coordinate refernce systems are the same, and
  • add a new column to the gdf_OrangeBlkGroups, setting its value to the County name.

→ Save the result as gdf_BlkGrps

In [59]:
#Check that the coordinate refernce systems are the same
gdf_BlkGrp_step1.crs == gdf_OrangeBlkGroups.crs
Out[59]:
True
In [62]:
#Add the county field
gdf_OrangeBlkGroups['County'] = "Orange"
In [67]:
#Append the geodataframes
gdf_BlkGrp = gdf_BlkGrp_step1.append(gdf_OrangeBlkGroups)
C:\Users\jo158\AppData\Local\Temp\ipykernel_9068\3951118901.py:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  gdf_BlkGrp = gdf_BlkGrp_step1.append(gdf_OrangeBlkGroups)
In [68]:
#Plot the output
gdf_BlkGrp.plot(column='County')
Out[68]:
<AxesSubplot: >

2.3: Dissolve block features to the tract level¶

We have Social Vulnerability Data to examine in our analysis, but these data are at the Tract, not BlockGroup level. Thus, to join these attributes to our geodataframe, we'll need to aggregate our blockgroups to the tract level. Fortunately, the GEOID10 attribute is structured such that the census tract is just the first 11 characters. So we will create a new column holding these first 11 characters, and then we'll dissolve our blockgroup features sharing the same tract ID to single features.

Doc: https://geopandas.org/aggregation_with_dissolve.html

  • First, create a new column listing tract IDs (the first 11 digits of the GEOID10)
  • Dissolve the features on this attribute, computing aggregate sum of the TOTAL_POP field
  • 2.3.1 Create the Tract column from the GEOID10 values
In [69]:
gdf_BlkGrp.iloc[0]
Out[69]:
geoid10                                           370630003023
total_pop                                                 1533
geometry     POLYGON ((-78.89598269934186 36.00892711836401...
County                                                  Durham
Name: 0, dtype: object
In [70]:
#Create the Tract column
gdf_BlkGrp['TRACT']=gdf_BlkGrps['geoid10'].str[0:11]
gdf_BlkGrp.head()
Out[70]:
geoid10 total_pop geometry County TRACT
0 370630003023 1533 POLYGON ((-78.89598 36.00893, -78.89597 36.010... Durham 37063000302
1 370630003021 779 POLYGON ((-78.89606 36.01699, -78.89708 36.017... Durham 37063000302
2 370630003022 1114 POLYGON ((-78.90180 36.01289, -78.90217 36.012... Durham 37063000302
3 370630005003 1289 POLYGON ((-78.92329 35.99672, -78.92398 35.995... Durham 37063000500
4 370630004011 898 POLYGON ((-78.93610 36.02206, -78.93612 36.021... Durham 37063000401
  • 2.3.2 Disslve features on the Tract column
In [72]:
#Dissolve features on tract, computing summed population
gdf_Tract = gdf_BlkGrp.dissolve(by='TRACT',aggfunc={'total_pop':'sum',
                                                   'County':'first'})

#Show the results
gdf_Tract.head()
Out[72]:
geometry total_pop County
TRACT
37063000101 POLYGON ((-78.87411 36.02349, -78.87417 36.023... 3152 Durham
37063000102 POLYGON ((-78.89264 36.01976, -78.89268 36.018... 4535 Durham
37063000200 POLYGON ((-78.88339 36.00374, -78.88394 36.003... 2946 Durham
37063000301 POLYGON ((-78.91097 36.01230, -78.91102 36.011... 2504 Durham
37063000302 POLYGON ((-78.89598 36.00893, -78.89632 36.008... 3426 Durham
In [75]:
#Plot the data
gdf_Tract.plot(column='total_pop');

2.4: Import the Social Vulnerability Data and join with the Tract features¶

Now that we have the data at the tract level, we can join the Social Vulnerability Index data, stored in a CSV file (./data/NC_SVI_2018.csv).

Doc: https://geopandas.org/mergingdata.html#attribute-joins

  • Import the SVI data as a Pandas dataframe
  • Append records from the SVI dataframe to the Tracts geodataframe
  • 2.4.1 Import and explore the SVI data into a Pandas dataframe
In [79]:
#Import and explore the SVI data
df_SVI = pd.read_csv('../data/NC_SVI_2018.csv', dtype={'FIPS':'str',
                                                       'ST':'str',
                                                       'STCNTY':'str'})

Challenge:
→ Modify the read_csv() command above so that 'ST' and 'STCNTY' are also imported as strings.

In [81]:
#Plot a histogram of the SVI values
df_SVI['SVI'].hist();

Whoops!!!

Values should be between 0 and 1, but we see in the histogram that a few value are down near -1000. Turns out a few records have SVI values of -999. We need to remove those records.
In [83]:
#Create a mask of values greater than or equal to zero
valid_mask = df_SVI['SVI'] >= 0
#Apply that mask
df_SVI_fixed = df_SVI.loc[valid_mask]
In [84]:
#View the histogram again
df_SVI_fixed['SVI'].hist()
Out[84]:
<AxesSubplot: >

Phew! Exploring the data payed off!

  • 2.4.2 Append the dataframe to the tracts
In [ ]:
#Have a look at the merge command syntax
gdf_Tract.merge?
In [85]:
#Join the SVI data to the tract features
gdf_Tract_joined = gdf_Tract.merge(df_SVI_fixed, 
                                   left_on='TRACT', 
                                   right_on='FIPS', 
                                   how='left')

#Examine the output
gdf_Tract_joined.head()
Out[85]:
geometry total_pop County ST STATE STCNTY COUNTY FIPS SVI
0 POLYGON ((-78.87411 36.02349, -78.87417 36.023... 3152 Durham 37 NORTH CAROLINA 37063 Durham 37063000101 0.8080
1 POLYGON ((-78.89264 36.01976, -78.89268 36.018... 4535 Durham 37 NORTH CAROLINA 37063 Durham 37063000102 0.8547
2 POLYGON ((-78.88339 36.00374, -78.88394 36.003... 2946 Durham 37 NORTH CAROLINA 37063 Durham 37063000200 0.5706
3 POLYGON ((-78.91097 36.01230, -78.91102 36.011... 2504 Durham 37 NORTH CAROLINA 37063 Durham 37063000301 0.7048
4 POLYGON ((-78.89598 36.00893, -78.89632 36.008... 3426 Durham 37 NORTH CAROLINA 37063 Durham 37063000302 0.3392
In [86]:
#Explore the output, look form null values
gdf_Tract_joined.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 275 entries, 0 to 274
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   geometry   275 non-null    geometry
 1   total_pop  275 non-null    int64   
 2   County     275 non-null    object  
 3   ST         270 non-null    object  
 4   STATE      270 non-null    object  
 5   STCNTY     270 non-null    object  
 6   COUNTY     270 non-null    object  
 7   FIPS       270 non-null    object  
 8   SVI        270 non-null    float64 
dtypes: float64(1), geometry(1), int64(1), object(6)
memory usage: 21.5+ KB
In [88]:
#Plot the output
gdf_Tract_joined.plot(column='SVI')
Out[88]:
<AxesSubplot: >

Looks like we have some features missing SVI data. Let's examine those more closely.

In [89]:
#Create a mask of null SVI values
null_mask = gdf_Tract_joined['SVI'].isnull()
null_mask.sum() # gives the number of null values
Out[89]:
5
In [90]:
#Apply the mask
gdf_Tract_joined.loc[null_mask]
Out[90]:
geometry total_pop County ST STATE STCNTY COUNTY FIPS SVI
20 POLYGON ((-78.91595 36.00978, -78.91718 36.009... 1941 Durham NaN NaN NaN NaN NaN NaN
59 POLYGON ((-78.88307 35.88604, -78.88532 35.886... 4 Durham NaN NaN NaN NaN NaN NaN
79 POLYGON ((-79.04624 35.91339, -79.04606 35.913... 2350 Orange NaN NaN NaN NaN NaN NaN
273 POLYGON ((-78.80909 35.88187, -78.80944 35.881... 2 Wake NaN NaN NaN NaN NaN NaN
274 POLYGON ((-78.74827 35.87112, -78.74843 35.871... 30 Wake NaN NaN NaN NaN NaN NaN

We can either assign a value to these missing values or leave them as no data. We'll just leave them blank for now...

2.5: Compute Population Density for Each Tract¶

Our combined dataframes have a field indicating the total population in each block group. We want to compute population density from this and from the area of each tract. We don't yet have an area field in our dataframe, but we can compute that from our spatial features. But before we can do this, we need to transform our data into a projected coordinate system. So... the steps for this analysis include:

  • Transform the dataframe to a projected coordinate system (UTM Zone 17N)
  • Compute a new Area_km2 column in our dataframe
  • Compute a new PopDens column in our dataframe by dividing TOTAL_POP by Area_km
  • 2.5.1 Transform the dataframe to a projected coordinate system (UTM Zone 17N)
In [91]:
#Project the data to UTM Zone 17N (EPSG 32617)
gdf_Tract_utm = gdf_Tract_joined.to_crs(32617)
In [92]:
gdf_Tract_utm.head()
Out[92]:
geometry total_pop County ST STATE STCNTY COUNTY FIPS SVI
0 POLYGON ((691558.001 3988644.329, 691553.732 3... 3152 Durham 37 NORTH CAROLINA 37063 Durham 37063000101 0.8080
1 POLYGON ((689897.492 3988193.797, 689896.525 3... 4535 Durham 37 NORTH CAROLINA 37063 Durham 37063000102 0.8547
2 POLYGON ((690769.570 3986435.083, 690720.109 3... 2946 Durham 37 NORTH CAROLINA 37063 Durham 37063000200 0.5706
3 POLYGON ((688263.171 3987331.051, 688261.050 3... 2504 Durham 37 NORTH CAROLINA 37063 Durham 37063000301 0.7048
4 POLYGON ((689621.976 3986986.040, 689591.484 3... 3426 Durham 37 NORTH CAROLINA 37063 Durham 37063000302 0.3392
  • 2.5.2 Compute a new Area_km2 column in our dataframe
In [93]:
#Compute a new column of geometry area (in sq km)
gdf_Tract_utm['Area_km2'] = gdf_Tract_utm['geometry'].area/1000000
  • 2.5.3 Compute a new PopDens column in our dataframe by dividing TOTAL_POP by Area_km
In [94]:
#Compute a new column of population density
gdf_Tract_utm['PopDens'] = gdf_Tract_utm['total_pop']/gdf_Tract_utm['Area_km2']
  • 2.5.4 Explore the results
In [102]:
#Plot the distribution of areas
#gdf_Tract_utm['Area_km2'].hist(bins=20)
gdf_Tract_utm['PopDens'].hist(bins=20)
Out[102]:
<AxesSubplot: >
In [97]:
#Plot a map of log tranformed population density
gdf_Tract_utm.plot(column='PopDens');
  • 2.5.5 Log transform the data to improve the visualization
In [103]:
#Log transform the pop_dens data
import numpy as np
gdf_Tract_utm['PD_log'] = np.log(gdf_Tract_utm['PopDens'])
In [104]:
#Plot the log-transformed distribution of areas
gdf_Tract_utm['PD_log'].hist(bins=20)
Out[104]:
<AxesSubplot: >
In [106]:
#Plot a map of log tranformed population density
gdf_Tract_utm.plot(column='PD_log');

2.6: Subset EV stations spatially¶

Doc: https://geopandas.org/set_operations.html
Previously, we subset EV stations by an attribute (City). Here we'll see how we can instead select features spatially. We do this with GeoPanda's Overlay operations.

To spatially select features:

  • Ensure both datasets share the same coordinate reference system; transform if needed
  • Use the overlay:intersection operation to select EV features overlap with the Tracts dataset
  • Examine the outputs
  • 2.6.1 Ensure both datasets share the same crs; transform if not
In [110]:
#Ensure both datasets share the same crs
# gdf_Tract_utm.crs == gdf_stations_all.crs equal to FALSE
gdf_Tract_utm.crs.to_string(),gdf_stations_all.crs.to_string()
Out[110]:
('EPSG:32617', 'EPSG:4326')
In [111]:
#Project one dataset to match the other
gdf_stations_all_utm = gdf_stations_all.to_crs(gdf_Tract_utm.crs)
In [115]:
#plot points on tracts
myplot = gdf_Tract_utm.plot(figsize=(12,6),alpha=0.6,color='gray')
gdf_stations_all_utm.plot(ax=myplot);
  • 2.6.2 Select EV stations that intersect the county features
In [118]:
#Intersect the two dataframes
gdf_stations_select = gpd.overlay(df1 = gdf_stations_all_utm,
                                  df2 = gdf_Tract_utm,
                                  how = 'intersection')
  • 2.6.3 Examine the results
In [119]:
#View the data
gdf_stations_select.head()
Out[119]:
ID Fuel Type Code Station Name City State ZIP Status Code Latitude Longitude Facility Type ... ST STATE STCNTY COUNTY FIPS SVI Area_km2 PopDens PD_log geometry
0 39016 ELEC City of Raleigh - Municipal Building Raleigh NC 27601 E 35.778416 -78.643470 STREET_PARKING ... 37 NORTH CAROLINA 37183 Wake 37183050100 0.3077 2.145192 1315.96601 7.182326 POINT (713000.165 3961933.778)
1 39017 ELEC City of Raleigh - Downtown Raleigh NC 27601 E 35.774350 -78.642287 STREET_PARKING ... 37 NORTH CAROLINA 37183 Wake 37183050100 0.3077 2.145192 1315.96601 7.182326 POINT (713117.970 3961485.268)
2 40751 ELEC City of Raleigh - Wilmington Station Deck Raleigh NC 27601 E 35.779370 -78.638203 PAY_GARAGE ... 37 NORTH CAROLINA 37183 Wake 37183050100 0.3077 2.145192 1315.96601 7.182326 POINT (713473.768 3962051.084)
3 42396 ELEC Raleigh Municipal Building Deck Raleigh NC 27601 E 35.779082 -78.641779 MUNI_GOV ... 37 NORTH CAROLINA 37183 Wake 37183050100 0.3077 2.145192 1315.96601 7.182326 POINT (713151.258 3962011.344)
4 42397 ELEC City of Raleigh Raleigh NC 27601 T 35.777283 -78.639281 STREET_PARKING ... 37 NORTH CAROLINA 37183 Wake 37183050100 0.3077 2.145192 1315.96601 7.182326 POINT (713381.895 3961817.202)

5 rows × 22 columns

In [121]:
#Plot the data

#Construct the first (lowest layer) of the data to plot, saving it as "the_plot"
the_plot = gdf_Tract_utm.plot(
    color='lightgrey', #Set the fill of the Tract features
    edgecolor='grey',  #Set the edge color...
    alpha=0.4,         #Set the transparency...
    figsize=(12,12))   #Set the size of the figure

#Plot the stations, setting them to share the axes of "the_plot"
gdf_stations_select.plot(
    ax=the_plot,        #Draw it on the plot created above
    column='SVI',      #Color the features by values in the City column
    markersize=45,      #Set the size of the markers
    edgecolor='white'); #Set the edge color of the markers

#Use Contextily to add a nice backdrop
ctx.add_basemap(
    ax = the_plot,         #Add it to our existing plot 
    crs=gdf_Tract_utm.crs, #Transform the background to match the data's crs
    source=ctx.providers.CartoDB.Voyager) #Set the type of backdrop

2.7: Spatially join tract attributes to EV features¶

Doc: https://geopandas.org/mergingdata.html#spatial-joins

Now that we have a proper susbset of EV stations, let's add demographic data to our EV locations by peforming a spatial join with the tract geodataframe.

  • 2.7.1 Ensure the datasets involved share the same coordinate reference system

→ What would happen if we made a typo and only included a single equals sign??

In [122]:
#Compare crs
gdf_stations_select.crs == gdf_Tract_utm.crs
Out[122]:
True
  • 2.7.2 Execute the spatial join
In [123]:
#Execute the spatial join
gdf_JoinedData = gpd.sjoin(left_df = gdf_stations_select,
                          right_df = gdf_Tract_utm,
                          how = 'left',
                          lsuffix = 'ev',
                          rsuffix = 'tracts')
  • 2.7.3 Explore the result
In [124]:
#View the data
gdf_JoinedData.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 313 entries, 0 to 312
Data columns (total 34 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   ID                313 non-null    int64   
 1   Fuel Type Code    313 non-null    object  
 2   Station Name      313 non-null    object  
 3   City              313 non-null    object  
 4   State             313 non-null    object  
 5   ZIP               313 non-null    object  
 6   Status Code       313 non-null    object  
 7   Latitude          313 non-null    float64 
 8   Longitude         313 non-null    float64 
 9   Facility Type     106 non-null    object  
 10  total_pop_ev      313 non-null    int64   
 11  County_ev         313 non-null    object  
 12  ST_ev             305 non-null    object  
 13  STATE_ev          305 non-null    object  
 14  STCNTY_ev         305 non-null    object  
 15  COUNTY_ev         305 non-null    object  
 16  FIPS_ev           305 non-null    object  
 17  SVI_ev            305 non-null    float64 
 18  Area_km2_ev       313 non-null    float64 
 19  PopDens_ev        313 non-null    float64 
 20  PD_log_ev         313 non-null    float64 
 21  geometry          313 non-null    geometry
 22  index_tracts      313 non-null    int64   
 23  total_pop_tracts  313 non-null    int64   
 24  County_tracts     313 non-null    object  
 25  ST_tracts         305 non-null    object  
 26  STATE_tracts      305 non-null    object  
 27  STCNTY_tracts     305 non-null    object  
 28  COUNTY_tracts     305 non-null    object  
 29  FIPS_tracts       305 non-null    object  
 30  SVI_tracts        305 non-null    float64 
 31  Area_km2_tracts   313 non-null    float64 
 32  PopDens_tracts    313 non-null    float64 
 33  PD_log_tracts     313 non-null    float64 
dtypes: float64(10), geometry(1), int64(4), object(19)
memory usage: 85.6+ KB
In [125]:
#Plot
the_plot = gdf_Tract_utm.plot(
    color='lightgrey',
    edgecolor='grey',
    alpha=0.4,
    figsize=(12,12))

gdf_JoinedData.plot(
    ax=the_plot,
    column='SVI_ev',
    markersize=60,
    edgecolor='white');

ctx.add_basemap(
    ax=the_plot, 
    crs=gdf_Tract_utm.crs,
    source=ctx.providers.CartoDB.Voyager)
In [126]:
#Make some graphical plots
ax=pd.DataFrame(gdf_JoinedData).boxplot(
    column='SVI_ev',
    by='City',
    rot=45,
    figsize=(19,5));

Part 3. Share results¶

With this document we now have a fully reproducible analytic workflow, complete with visualizations of our result. We can export this notebook as an HTML document and share that, or if we commit this document to our GitHub account and share the link to that notebook.

3.1: Sharing your notebook¶

  • 3.1.1

  • Export your completed notebook as an HTML document.

  • View it in a web browser

  • 3.1.2

  • Commit the changes to your forked repository

  • Navigate to https://nbviewer.jupyter.org/ and paste your repository's URL

  • Share this link with others...

We can also save the resulting geodataframes as feature classes for more analysis, either in Python or in ArcGIS Pro.

3.2: Exporting the geodataframe¶

We can export our gdf_JoinedData geodataframe easily,either as a shapefile or a CSV file.

  • 3.2.1 Export the final geodataframe to a feature class
In [127]:
#Export the geodataframe to a shapefile
gdf_JoinedData.to_file(
    filename='../data/EV_sites_with_data.shp',
    driver='ESRI Shapefile',
    index=False
)
C:\Users\jo158\AppData\Local\Temp\ipykernel_9068\3809339296.py:2: UserWarning: Column names longer than 10 characters will be truncated when saved to ESRI Shapefile.
  gdf_JoinedData.to_file(
  • 3.2.2 Export the final geodataframe to a csv file
In [128]:
#Export the geodataframe to a CSV file
pd.DataFrame(gdf_JoinedData).to_csv(
    '../data/my_saved_data.csv',
    index=False
)
In [ ]: